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Abstract. We consider a continuous-time model for inventory management with Markov mod- 
ulated non-stationary demands. We introduce active learning by assuming that the state of the 
world is unobserved and must be inferred by the manager. We also assume that demands are 
observed only when they are completely met. We first derive the explicit filtering equations and 
pass to an equivalent fully observed impulse control problem in terms of the sufficient statistics, 
the a posteriori probability process and the current inventory level. We then solve this equivalent 
formulation and directly characterize an optimal inventory policy. We also describe a computa- 
tional procedure to calculate the value function and the optimal policy and present two numerical 
illustrations. 

1. Introduction 

Inventory management aims to control the supply ordering of a firm so that inventory costs are 
minimized and a maximum of customer orders are filled. These are competing objectives since 
low inventory stock reduces storage and ordering costs, whereas high inventory avoids stock-outs. 
The problem is complicated by the fact that real-life demand is never stationary and therefore the 
inventory policy should be non-stationary as well. A popular method of addressing this issue is 
to introduce a regime-switching (or Markov-modulated) demand model. The regime is meant to 
represent the economic environment faced by the firm and drives the parameters (frequency and 
size distribution) of the actual demand process. However, typically this economic environment is 
unknown. From a modeling perspective, this leads to a partially observed hidden Markov model for 
demand. Thus, the inventory manager must simultaneously learn the current environment (based 
on incoming order information), and adjust her inventory policy accordingly in anticipation of 
future orders. 

The literature on inventory management with non-stationary Markovian demand originated with 
[21] who considered a continuous-time model where the demand levels or intensity are modulated 
by the state of the world, which are assumed to be observable. The discrete-time counterpart of 
this model was then analyzed by (28] who allowed a very general cost structure and proved a more 
formal verification theorem for existence and regularity of the value function and existence of an 
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optimal feedback policy. More recent work on fully observed non-stationary demand can be found 
in pi]. 

Inventory management with partial information is a classical topic of operations research. In the 
simplest version as analyzed by [U |2S] and references therein, the demand distribution is unknown 
and must be learned by the controller. Thus, a finite-horizon discrete time parameter adaptive 
model is considered, so that the demand distribution is taken to be stationary and i.i.d., but with 
an unknown parameter. This parameter is then inferred over time using either an exponential 
smoothing or a Bayesian learning mechanism. In these papers the method of solution relied 
on the special structure of some particular cases (e.g. uniform demand level on [0, w], with w 
unknown) where a dimension reduction is possible, so that the learning update is simplified. Even 
after that, the problem remains extremely computationally challenging; accordingly the focus of 
[25] has been on studying approximate myopic or limited look-ahead policies. 

Another important strand of literature investigates the lost sales uncertainty. In that case, 
demand is observed only if it is completely met; instead of a back-order, unmet demand is lost. 
This then creates a partial information problem if demand levels are non-stationary. In particular, 
demand levels evolving according to a discrete-time Markov chain have been considered in the 
newsvendor context (completely perishable inventory) by [2H H3] 19]. and in the general inventory 
management case by [7J. [El [H]] have analyzed the related case whereby current inventory level 
itself is uncertain. 

The main inspiration for our model is the work of [30J who considered a version of the [28] model 
but under the assumption that current world state is unknown. The controller must therefore 
filter the present demand distribution to obtain information on the core process. [31)] take a 
partially observed Markov decision processes (POMDP) formulation; since this is computationally 
challenging, they focus on empirical study of approximate schemes, in particular myopic and 
limited look-ahead learning policies, as well as open-loop feedback learning. A related problem of 
dynamic pricing with unobserved Markov-modulated demand levels has been recently considered 
by [3J. In that paper, the authors also work in the POMDP framework and propose a different 
approximation scheme based on the technique of information structure modification. 

In this paper we consider a continuous-time model for inventory management with Markov 
modulated non-stationary demands. We take the [29] model as the starting point and introduce 
active learning by assuming that the state of the core process is unobserved and must be inferred 
by the controller. We will also assume that the demand is observed only when it is completely 
met, otherwise censoring occurs. Our work extends the results of (9] [7] and [30] to the continuous- 
review asynchronous setting. Use of continuous- rather than discrete-time model facilitates some of 
our analysis. It is also more realistic in medium- volume problems with many asynchronous orders 
(e.g. computer hardware parts, industrial commodities, etc.), especially in applications with strong 
business cycles where demand distribution is highly non-stationary. In a continuous-time model, 



INVENTORY MANAGEMENT WITH PARTIALLY OBSERVED NONSTATIONARY DEMAND 3 



the controller may adjust inventory immediately after a new order, but also between orders. This 
is because the controller's beliefs about the demand environment are constantly evolving. Such 
qualitative distinction is not possible with discrete epochs where controls and demand observations 
are intrinsically paired. 

Our method of solution consists of two stages. In the first stage (Section [2]), we derive explicit 
filtering equations for the evolution of the conditional distribution of the core process. This is 
non-trivial in our model where censoring links the observed demand process with the chosen 
control strategy. We use the theory of partially observed point processes to extend earlier results 
of [21 [261 E] m Proposition 2.1 The piecewise deterministic strong Markov process obtained in 
Proposition 2A_ allows us then to give a simplified and complete proof of the dynamic programming 
equations, and describe an optimal policy (see Section[3]). We achieve this by leveraging the general 
probabilistic arguments for the impulse control of piecewise deterministic processes of [T71 [151 EE] 
and using direct arguments to establish necessary properties of the value function. Our approach is 
in contrast with the POMDP and quasi-variational formulations in the aforementioned literature 
that make use of more analytic tools. 

Our framework also leads to a different flavor for the numerical algorithm. The closed-form 
formulas obtained in Sections [2] and [3] permit us to give a direct and simple-to-implement com- 
putational scheme that yields a precise solution. Thus, we do not employ any of the approximate 
policies proposed in the literature, while maintaining a competitive computational complexity. 

To summarize, our contribution is a full and integrated analysis of such incomplete information 
setting, including derivation of the filtering equations, characterization of an optimal policy and a 
computationally efficient numerical algorithm. Our results show the feasibility of using continuous- 
time partially observed models in inventory management problems. Moreover, our model allows for 
many custom formulations, including arbitrary ordering/ storage / stock-out / salvage cost structures, 
different censoring formulations, perishable inventory and supply-size constraints. 

The rest of the paper is organized as follows. In the rest of the introduction we will give 
an informal description of the inventory management problem we are considering. In Section [2] 
we make the formulation more precise and show that the original problem is equivalent to a 
fully observed impulse control problem described in terms of sufficient statistics. Moreover, we 
characterize the evolution of the paths of the sufficient statistics. In Section [3] we show that the 
value function is the unique fixed point of a functional operator (defined in terms of an optimal 
stopping problem) and that it is continuous in all of its variables. Using the continuity of the 
value function we describe an optimal policy in Section In Section 3J3 we describe how to 
compute the value function using an alternative characterization. Finally, in Section [4] we present 
two numerical illustrations. Some of the longer proofs are left to the Appendix. 
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1.1. Model Description. In this section we give an informal description of the inventory man- 
agement with partial information and the objective of the controller. A rigorous construction is 
in Section |2l 

Let M be the unobservable Markov core process for the economic environment. We assume that 
M is a continuous-time Markov chain with finite state space E = {1, . . . , m}. The core process 
M modulates customer orders modeled as a compound Poisson process X. More precisely, let 
Yi,Y 2 , . . . be the consecutive order sizes, taking place at times cj., cr 2 , . . .. Define 

(1.1) N(t) = sup{s : a s < t} 
to be the number of orders received by time t, and 

N(t) 

(1.2) x t = ^n, 

k=l 

to be the total order size by time t. Then, the intensity of N (and X) is Aj whenever M is at 
state i, for % G E. Similarly, the distribution of Yj~ is z/j conditional on M CTfc = i. This structure is 
illustrated schematically in Figure [TJ The assumption of demand following a compound Poisson 
process is standard in the OR literature, especially when considering large items (the case of 
demand level following a jump-diffusion is investigated by [TT]). 

We assume that orders are integer-sized with a fixed upper bound R, namely 

Assumption. Each Vi, i G E is a discrete bounded distribution on Z ; so that Yk G {1,2, . . . ,R}. 

The controller cannot observe M; neither does she directly see X. Instead, she receives the 
censored order flow W. The order flow W consists of filled order amounts (Zg), which informally 
correspond to the minimum between actual order size and available inventory. Hence, if total 
inventory is insufficient, a stock-out occurs and the excess order size is left indeterminate. The 
model where order flow is always observed will also be considered as a special case with zero 
censoring. The precise description of W will be given in Section 2.1. 

Let Pt be the current inventory at time t. We assume that inventory has finite stock capacity, so 
that Pt G [0, P]. Inventory changes are driven by two variables: filled customer orders described by 
W and supply orders. Customer orders are assumed to be exogenous; every order is immediately 
filled to its maximum extent. When a stock-out occurs, we consider two scenarios: 

• If there is no censoring, then the excess order amount is immediately back-ordered at a 
higher penalty rate. 

• If there is censoring, a lost opportunity cost is assessed in proportion to expected excess 
order amount. 

Otherwise, the inventory is immediately decreased by the order amount. Supply orders are com- 
pletely at the discretion of the manager. Let £1,^2, • • • denote the supply amounts (without back- 
orders) and n, T2, • • • the supply times (when supply order is placed). 
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Figure 1. The environment process M modulates the marked point process X = 
((<xi, Y"i), . . .) representing demand order flow. In this illustration, M takes the values 
M t G {1, 2, 3} = E, and the mark distributions are arranged such that \i\ < /x 2 < fJ-3 
where //j = E[F|M 4 = i]. Also, the intensities are arranged A 3 < A 2 < Ai, with 
Aj = E[(7i|M u = z,0 < u < T\. Thus, in regime 1 demand is frequent and demand 
amounts are small, in regime 2 demand is average and amounts are small to average, 
and in regime 3 demand is rare but consists of relatively large orders. 



To summarize, the dynamics of P are (compare to (2.12) below): 
(1.3) 



P 



fit 



dP t 



— Pr k - + £jfc, 

= 



fill customer order 
new supply order 
otherwise. 



We assume that the manager can only increase inventory (no disposal is possible) and that 
inventory never perishes. Alternatives can be straightforwardly dealt with and are considered in 



a numerical example in Section 3.3 



We denote the entire inventory strategy by the right-continuous piecewise constant process 
f : [0, T] x Q A, with & = £ k if r k < t < r k+ i or 



(1.4) 



r k <T 



1 



.!)(*)• 
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Parameter 


Meaning 


c(a) 


storage cost for a items per unit time 


K(a) 


instantaneous stock-out cost for a shortage of a items 


h 


order cost for one item 


C 


fixed cost of placing an order 


P 


discount factor for NPV calculation 


P 


maximum inventory size 


R 


maximum demand size 



Table 1. Parameters of the model 



The goal of the inventory manager is to minimize inventory costs as appearing in the objective 



function (1.5) below among all admissible inventory strategies £. The admissibility condition 
concerns first and foremost the information set available to the manager. Because only W is 
observable, the strategy £ should be determined by the information generated by W, namely each 



Tfc must be a stopping time of the filtration ¥ w = {J^ } of W. Similarly, the value of each £ fe is 
determined by the information J-^[ revealed by W until t&. Also, without loss of generality we 
assume that £ has a finite number of actions, so that P(r^ < T) — > as k — > oo. Since strategies 
with infinitely many actions will have infinite costs, we can safely exclude them from consideration. 
We denote by U(T) the set of all such admissible strategies on a time interval [0, T]. 

The selected policy £ directly influences current stocks P; when we wish to emphasize this de- 
pendence we will write P t = P t . The cost of implementing £ is as follows. First, inventory storage 
costs at fixed rate c(Pt) are accessed. We assume that c is positive, increasing and continuous. 
Second, a supply order of size costs h ■ + £, for positive h and (. Finally, if a stock-out 
occurs due to insufficient existing stock, a penalty reflecting the lost opportunity cost is assessed 
at amount E [K((Y k - P CTfc _) + )|J^] . We assume that the penalty function K is positive and 
increasing with K(0) = 0. Thus, the total performance of a strategy £ on the horizon [0,T] is 



(1.5) f e-P t c(P t )dt+ V e -^(h-£ k + ()+ V e-f"'K((Y t 

JO ^rn „._ ^rn 



Pae-) + ), 



k:r k <T £:a e <T 



where p > is the discount factor for future revenue. Table [T] summarizes our notation and the 
meaning of the model parameters. 



Observe that the objective in (1.5) involves the distribution of Z k s (which affect P) and Yfc's 
(which enter into objective function through stock-out costs). Both of these depend on the state 
M. Since the core process M is unobserved, the controller must therefore carry out a filtering 
procedure. We postulate that she collects information about M via a Bayesian framework. Let 
7? = (tti, . . . ,7r m ) = (P{M = 1}, . . . ,P{M = m}) be the initial (prior) beliefs of the controller 
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about M and P 71 " the corresponding conditional probability law. The controller starts with beliefs 
7r, observes Z, updates her beliefs and adjusts her inventory policy accordingly. 



These notions and the precise updating mechanism will be formalized in Section 2.4 The 



solution will then proceed in two steps: an initial filtering step and a second optimization step. 
The inference step is studied in Section [2j where we introduce the a posteriori probability process 
n. The process II summarizes the dynamic updating of controller's beliefs about the Markov chain 
M given her point process observations. The optimal switching problem (2.5) is then analyzed in 
Section El 



2. Problem Statement 

In this section we give a mathematically precise description of the problem and show that this 
problem is equivalent to a fully observed impulse control problem in terms of the strong Markov 
process (II, P). We will also describe the dynamics of the sufficient statistics (II, P). 

2.1. Core Process. Let (fi,"H,P) be a probability space hosting two independent elements: (i) 
a continuous time Markov process M taking values in a finite set E, and with infinitesimal gen- 
erator Q = (qij)ij£E, (h) X^\ . . . ,X( m ) which are independent compound Poisson processes with 
intensities and discrete jump size distributions (Ai, ^i), . . . , (A m , u m ), respectively, m G E. 

The core point process X is given by 

(2.1) X t ±X + f J2MM s =i}dX®, t>0. 

Thus, X is a Markov-modulated Poisson process (see e.g. [23]); by construction, X has inde- 
pendent increments conditioned on M = {M t }t>o- Denote by 00,01, ... the arrival times of the 
process X, 

u 1 = M{t > a^ x : X t jt X t ^}, I > 1 with a = 0. 

and by Yi, Y2, . . . the R- valued marks (demand sizes) observed at these arrival times: 

Y e = X ae -X ae _, £>1. 

Then conditioned on {M ae = i}, the distribution of Ye is described by Vi{dy) = fi(y)dy on 
(Z+,B(Z + )). 

2.2. Observation Process. Starting with the marked point process (M,X), the observable is a 
point process W which is derived from (M,X). This means that the marks of W are completely 
determined by (M,X) (and the control). Fix an initial stock P^ = a. We first construct an 
auxiliary process The first mark of is (<Ji,Zi), where <j\ is the first arrival time of X 
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and where the distribution of the first jump size Z\ = (Z\, Zf) G Z conditional on (M,X) given 

by 



F{Z 1 = z\Yi = V, M n =i)= C(y, z; P c 



6(1) 



where C(y,z;a), y G Z + ,z G Z,a G [0, P] are the {0, 1}- valued censoring functions satis- 
fying s Zj^ z C{y,z;a) G {0,1}. In our context, it is convenient to take Z = Z + x {0,A}, 
where z = (z l ,z 2 ) G Z represents a filled order of size z 1 , and the second component z 2 
indicates whether a stock-out occurred or not. Censoring of excess orders then corresponds 
to C(y,z;p) = l{(o,p]}(y)l{( y ,o)}(^ + l { ( P:R ) } (y)l { (i piA)} (z). Alternatively, without censoring we 
take Z = Z + x Z + (second component now indicating actual order size), and C(y,z;p) = 

l {(0,p]}(y)l{(y,y)}(z) + 1 {( P ,R]}(y) 1 {(lp},y)}(z)- 

Once {ax, Z\) is observed, we update Pal = a — Z\ > 0, and set P t = P&1 for o\ < t < cr 2 , 
where o~i is the second arrival time of X. Proceeding as before, we will obtain the marked point 
process = (cr^, Zi) and the corresponding uncontrolled stock process 

We now introduce the first impulse control. Let {^ r /' /(1) } be the filtration generated by 
and let (ti,£i) be an J 71 ^ 1 ' -stopping time, and an J r jf <1) -measurable Z + -valued random variable, 



respectively, satisfying t% < T, < £i < P — The impulse control means that we take 

W = W^l[o tTl ), P t = P t l[o )71 ) and repeat the above construction on the interval [ti,T] starting 
with the updated value P^f = P^L + £>•■ 

Inductively this provides the auxiliary point processes together with the impulse controls 
{ T k,Ck), k = 1,2,.... Letting £ = (ti, £i, r 2 , £2, • • •) we finally obtain the ^-controlled inventory 
process P, as well as the ^-controlled marked point process W = ^ k W^ k 'l[o >Tk ). By construction, 
both P and £ are F^-measurable. We denote by P a '^ the resulting probability law of (W,P). 

Summarizing, P is a piecewise-deterministic controlled process taking values in [0, P] and evolv- 
ing as in (1.3); the arrival times of W are those of X, and the distribution of its marks {Zg) depends 
inductively on the latest P a£ -, the mark Yi of core process X, and the censoring functions c?(y, z). 
For further details of the above construction of P a '^ we refer the reader to [TT1 pp. 228-230]. Our 
use of censoring functions is similar to the construction in [2J. 

2.3. Statement of the Objective Function in Terms of Sufficient Statistics. Let D = 

{7? G [0, l] m : tti + . . . + 7r m = 1} be the space of prior distributions of the Markov process M. Also, 
let S(s) = {t: F — stopping time, r < s,P — a.s} denote the set of all F-stopping times smaller 
than or equal to s. 

Let F n,a £ denote the probability measure P a '^ such that the process M has initial distribution 
7?. That is, 

(2.2) P"' a '«{A} = 7nF a ^{A\M = 1} + . . . + 7r rt P a ' f {^|M = n} 

for all A G Tjf. P^ can be similarly defined. In the sequel, when £ = 0, we will denote the 
corresponding probability measure by P 71 "' 0, 
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We define the .D-valued conditional probability process U(t) = (Ili(t), . . . , II m (t)) such that 
(2.3) Ui(t) = F*> a 't{M t = i\J^}, for i G E, and t > 0. 

Each component of II gives the conditional probability that the current state of M is {i} given 
the information generated by W until the current time t. 



Using II we convert the original objective (1.5) into the following F-adapted formulation. Ob- 
serve that given n(er^), the distribution of Yt is YlieE^ifot - ) u i- Therefore, starting with initial 
inventory a G [0, P] and beliefs if, the performance of a given policy £ G U(T) is 

jt{T,if,a)±W^ C V^)dt+ ^e-^^TlL^) ! K ((y - P at _) + ) u^dy) 



(2.4) 



i&E 



5>-"*(Zi& + C) 

ken+ 



The first argument in is the remaining time to maturity. Also, (2.4) assumed that the terminal 



salvage value is zero, so at T the remaining inventory is completely forfeited. The inventory 
optimization problem is to compute 



U(T,if,a) = inf J^(T,n,a), 



(2.5) 

and, if it exists, find an admissible strategy £* attaining this value. Without loss of generality we 
will restrict the set of admissible strategies satisfying 



(2.6) 



-pTk 



< oo 



otherwise infinite costs would be incurred. Note that the admisible strategies have finitely many 
switches almost surely for any given path. The equivalence between the "separated" value function 



in (2.5) and the original setting of (1.5) is standard in Markovian impulse control problems, see 



e.g. irj. 

The following notation will be used in the subsequent analysis: 



(2.7) 

and 
(2.8) 



I(t) 



Jo ieE 



{M B =i}ds, 



A = max Aj. 

i£E 



It is worth noting that the probability of no events for the next u time units is P^jai > u} 
E*[e- /(u) ]. 
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IT, P). In this section we describe the filtering procedure of the controller, 
explicitly shows the evolution of the processes (II, P). This is non- 



2.1 



2.4. Sample paths of 

In particular, Theorem 

trivial in our model where censoring links the observed demand process with the chosen control 
strategy. The description of paths of the conditional probability process when the control does not 
alter the observations is discussed in Proposition 2.1 in [26] and Proposition 2.1 of |6J. Filtering 
with point process observations has also been studied by [21 EE [201 []]• Also, see [IB] for general 
description of inference in various hidden Markov models in discrete time. 

Even though the original order process X has conditionally independent increments, this is no 
longer true for the observed requests W since the censoring functions depend on P which in turn 
depends on previous marks of W. Nevertheless, given M s = i for < s < <j£, the interarrival 
times of W are i.i.d. Exp(Xi), and the distribution of Zg is only a function of P ai _ r - Therefore, if 
we take a sample path of W where r-many arrivals are observed on [0,t], then the likelihood of 
this path would be written as ¥ 7r,a '^{ak G dt k , Z k G dz k , cr r < t; k < r | M s = i,s <t} = 



e Xlt \\\idt k ■ gi (z k ;P tk „), 



k=l 



(2.9) 

r 

[Xte-^dh] ■ ■ ■ [X^^-^dt^e'^-^ H fi(y)C(y, z k ; P tk . 

k=l y 

where 

ye{l,...,R} 

denotes the conditional likelihood of a request of type z (which is just the sum of conditional 
likelihood of all possible corresponding order sizes y). Note that in the case with censoring 
gi ((z\z 2 );p) = En=Lpj+l^( n ) 1 {^ 2 =A} + fi(z r )l {z 2 =0} . 
More generally, we obtain 



(2.10) l {Mt =i } ■ F w > a {ai G dt t , Z t = z e , a r <t;£<r 
= l{M t =i] • exp - 



M s ,s <t 



Jo j=i / e=i \jeE 



{M te =j}[Xjdt e ■ gj (z e ;P te -)] 



The above observation leads to the description of the paths of the sufficient statistics (IT, P). 

Proposition 2.1. Let us define x(t, 7?) = (xi(t, vf), . . . , x m (t, 7?)) via 

-i(t)] 



(2.11) 



F^i^ >t,M t = i} _ E* [l {Mt =i } ■ e- /(t) ] 



P^{(Ti > t} 



E- [e-'W] 



for % G E. 
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Then the paths of (II, P) can be described by 

U(t) = x(t - a e , n(cr^) j , a e <t < a l+x 



(2-12) 



e N 



Mat) 



ER 
ii- 



y=lP(a e -)\+l 



\ifi(y)Tli((Te-) 



P(a e ) = P{a t -) - Z\- 
P{T k )=P(T k -)+tk. 



if Z\ = 0; 



if 



A; 



Proof. See Section 



A.2 



The main idea is to express Ilj(t) as a ratio of likelihood functions, and 
to use (2.10) to obtain explicit formulas for likelihood of different observations conditional on the 
state of M. □ 

The deterministic paths described by x come from a first-order ordinary differential equation. 
To observe this fact first recall that the components of the vector 

-'(*)] 



(2.13) m(t, t?) = (mi(*, vf), . . . , m m (t, vf)) 4 ( E^ a [l {Mt=1} • e" /(t) ] , . . . , E^ a [l 



{M t =m} ■ 



solve dmi(t,7r)/dt = —\imi(t, it) + J2j£E m j(t^) ' ( see e -§- [El E7J [23] ) . Now using (2.11 ) and 
applying the chain rule we obtain 



(2.14) 



d,Xi(t, 7f) 

Jt 



^ qj,iXj(t, 7?) - XiXi(t, 7f) + Xi(t, 7f) ^ AjXj(t, 
\j€E j&E 



7T 



Note that since (II, P) is a piecewise deterministic Markov process by the last proposition, the 
results of [T7] imply that this pair is a strong Markov process. 

3. Characterization and Continuity of the Value Function 

A standard approach (see e.g. [12]) to solving stochastic control problems makes use of the 
dynamic programming (DP) principle. Heuristically, the DP implies that to implement an optimal 
policy, the manager should continuously compare the intervention value, i.e. the maximum value 
that can be extracted by immediately ordering the most beneficial amount of inventory, with the 
continuation value, i.e. maximum value that can be extracted by doing nothing for the time being. 
In continuous time this leads to a recursive equation that couples U(t,n, a) with U(t — dt, tt, b) for 
b > a. Such a coupled equation could then be solved inductively starting with the known value of 
£7(0, vf, a). 

In this section we show that the above intuition is correct and that U satisfies a coupled optimal 



stopping problem. More precisely, we show in Theorem 3.1 that it is the unique fixed point of 
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a functional operator Q that maps functions to value functions of optimal stopping problems. 
This gives a direct and self-contained proof of the DP for our problem. We also show that the 
sequence of functions that we obtain by iterating Q starting at the value of no-action converges to 
the value function uniformly. Since Q maps continuous functions to continuous functions and the 
convergence is uniform, we obtain that the value function U is jointly continuous with respect to all 
of its variables. Continuity of the value function leads to a direct characterization of an optimal 



strategy in Proposition 3.3 The analysis of this section parallels the general (infinite horizon) 
framework of impulse control of piecewise deterministic processes (pdp) developed by [15J. We 



should also point out that Theorem 3A is used to establish an alternative characterization of the 
value function, see Proposition 3.4 , which is more amenable to computing the value function. 

First, we will analyze the problem with no intervention. This analysis will facilitate the proofs 
of the main results in the subsequent subsection. 

3.1. Analysis of the Problem with no Intervention. Let Uq be the value of no-action, i.e., 



(3.1) 



U (T,n,a)=E^ a 



I e-? t c(P t )dt+ z~ pak K{{Y k - P a ^) + 

J° k:a k <T 



)Vi{dy) 



k:a k <T i£E 

We will prove the continuity of Uq in the next proposition; this property will become crucial in the 
proof of the main result of the next section. But before let us present an auxiliary lemma which 
will help us prove this proposition. 

Lemma 3.1. For all n > 2, we have the uniform bound 

XT 



(3.2) 



n — 1 



Proof. Step 1. First we will show that 
(3.3) 



E 7T,a [ e -u* n ^ < f 



A 



A + U 



The conditional probability of the first jump satisfies P^cti > t\M} = e" /(t) . Therefore, 



E 7?,a [ e -WTi|jtf] = E7' a 



" roc 




/ ue~ ut dt 


M 


_J (Ji 





(3.4) 



F n {a 1 < t\M}ue~ ut dt 

POO 

/ [1 - e-'W] ue~ ut dt 
Jo 



< 



1 - e 



-xt 



ue~ ut dt 



A 



u + A 
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Since the observed process X has independent increments given M, it readily follows that K n ' a [e~ MO " n |M] < 
A™/(A + u) n , which immediately implies (3.3). 
Step 2. Note that 

(3.5) P*> a {T > a n } < W [l {T>(7n} (T/a n )] < W [T/a n \ . 



Since l/a n = J °°e anU du, an application of Fubini's theorem together with (3.3) 



(3.6) 

implies the result. 



Or. 



< 



A 



A + u 



du 



n — 1 



n > 2, 



□ 



Define the jump operators Si through their action on a test function w by 



(3.7) Siw(t,7r,a) 



w t 



' I J2 j& E x jfj(y) 



+ K({y-a)+)}ui(dy), 



for i G E. 



The motivation for Si comes from the dynamics of II in Proposition 2A_ and studying expected 
costs if an immediate demand order (of size y) arrives. 

Proposition 3.1. Uq is a continuous function. 

Proof. Let us define a functional operator T through its action on a test function w by 
(3.8) 



Tw(T, 7?, a) 



)n{dy) 



raiAT / /> 

/ e-^c(p(t,a))dt + l Wl<T} (e-^J2 U ^ K ^ ~ P < 

JO V JR + 

+ w(T - ax,U ax , P a . 

= / e -pu rrii(u, ff) • [c(p(u, a)) + Xj ■ Sjw(T — u, x(u, ir),p(u, a))] du. 

The operator T is motivated by studying expected costs up to and including the first demand 
order assuming no-action on the part of the manager. 

It is clear from the last line of (3.8) that T maps continuous functions to continuous functions. 
As a result of the strong Markov property of (II, P) we observe that Uq is a fixed point of T, and 
if we define 

(3.9) k n+ i(T, 7?, a) = Yk n (T, n, a), ko(T,n,a) = 0, T G R+, n e D,a £ A 

then k n /* Uq (also see Proposition 1 in [15J). To complete our proof we will show that k n converges 
to Uq uniformly (locally in the T variable); since all the elements of the sequence (fc n )neN are 
continuous the result then follows. 
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Using the strong Markov property we can write k n as 



cr„AT 



(3.10) k n (T, 7?, a) = E 7r ' a 

from which it follows that 

\U {T,Tf,a)-k n {T,7f,a)\ < E*< 



e- pt c(P t ) dt + \ ■-' s " 1] 



e~^K((Y k - P CTfc _ 



!{T> CT „} ( [ T e- pt c(P)dt + K(R)J2e 



(3.1i; 



< E*'° [1 {T>CT „ } (c(P)T + K(R)(N(T) - iV(a n 

< P ^{ T > a n }c(P)T + K(R)E*> a [l {T>an} N(T 



< p-' a {T > a n }c(P)T + i^iZWP^T > a n }WE-> a [iV(T) 2 ], 



where we used the Cauchy-Schwarz inequality to obtain the last inequality. Using Lemma |3.1 
and E^ a [N(T) 2 } < AT + (AT) 2 we obtain 



(3.12) 



\U (T, 7T, a) - k n (T } 7?,a)\< C (P)T-^— + K{R)\/ XT + (AT) 2 

n — 1 



< c(P)T^- + K(R) y XT + A V 
n — 1 



XT 
n — 1 

AT 
n — 1 



for any T G [0, T]. Letting n — >■ 00 we see that (A; n ) n eN converges to Uq uniformly on [0, T}. Since 
T is arbitrary, the result follows. □ 

3.2. Dynamic Programming Principle and an Optimal Control. We are now in position 
to establish the DP for U and also characterize an optimal strategy. In our problem the DP takes 



the form of a coupled optimal stopping problem of Theorem 3.1 



Let us introduce a functional operator Ai whose action on a test function w is 
(3.13) Mw{T,n,a)= min (w(T, 7?, b) + h(b - a) + (}. 

b:a<b<P < > 

The operator Ai is called the intervention operator and denotes the minimum cost that can be 
achieved if an immediate supply of size b — a is made. 

Lemma 3.2. The operator Ai maps continuous functions to continuous functions. 

Proof. This result follows since the set valued map a — > [a, P] is continuous (see Proposition D.3 
of [22]). □ 



We will denote the smallest supply order the minimum in (3.13) by 
(3.14) d Mw (T, 7T, a) = min j& G [a, P] : w(T, 7?, b) + h(b — a) + ( = A4w(T, tt, a) j . 
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Let us define a functional operator Q by its action on a test function V as 
(3.15) 

GV(T,n,a)= inf E^[/^-^P s )d S + Vre^n k < T} ri« / K((y - P at -) + ) Vi {dy) 

+ e- pT MV(T -r, Il r ,P r 

for T G IR+, 7T G D, and aG A The above definition is motivated by studying minimal expected 
costs incurred by the manager until the first supply order time r. 

Lemma 3.3. The operator Q maps continuous functions to continuous functions. 



Proof. This follows as a result of Lemma 3.2 as shown in Corollary 3.1 of [26] (see also Remark 
3.4 in when M.w is continuous, then the value function Qw of this optimal stopping problem 
is also continuous. □ 



Let V = Uq (from (3.1)) and 
(3.16) K+i = GV n , n > 0. 

Clearly, since Q is a monotone/positive operator, i.e. for any two functions f\ < fi we have 
Qfi < Qfii and since V\ < V , (K 1 )„ £ n is a decreasing sequence of functions. The next two 
propositions show that this sequence converges (point-wise) to the value function, and that the 
value function satisfies the dynamic programming principle. Similar results were presented in 
Propositions 3.2 and 3.3 in [5] (for a problem in which the controls do not interact with the 
observations). The proofs of the following propositions are similar, and hence we give them in the 
Appendix for the reader's convenience. 

Lemma 3.4. V n (T, n, a) ! U(T, n, a), for any T G M+, 7r G D , a G A. 

Proof. The proof makes use of the fact that the value functions defined by restricting the admissible 
strategies to the ones with at most n > 1 supply orders up to time T can be obtained by iterating 
operator Q n-times (starting from Uq). This preliminary result is developed in Appendix B.l The 



details of the proof can be found in Appendix |B.2 □ 



Proposition 3.2. The value function U is the largest solution of the dynamic programming equa- 
tion QU = U , such that U < Uq. 



Proof. The result follows from the monotonicity of Q and Proposition |3.4| See Appendix |B.3| for 
the details. □ 



The Theorem below improves the results of Proposition 3^ and helps us describe an optimal 
policy. Let us first point out that Uq and hence U are bounded. 
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Remark 3.1. It can be observed from the proof of Proposition 3.1 that the value function U(T, •, a) 
is uniformly bounded, 

■■"/■ 



< U{T,n,a) < U {T,jr,a) 



E 71 



f e-<> s c{P s ) ds + J2^ pak k^<T}K((Y k - P a 
Jo t 



< / e~ ps c(P) ds + E w,a 

IQ 



< c(P)T + K{R)E^ a [N{T)] < [c{P) + K(R)\\ ■ T, 

since N is a counting process with maximum intensity A. 

Below is the main result of this section. 

Theorem 3.1. The value function U is the unique fixed point of Q and it is continuous. 

Proof. Step 1. Let us fix T > 0. We will first show that U is the unique fixed point of Q and 
that {V n ) n ^ converges to U uniformly on T G [0, T], n G D, a G A. Let us restrict our functions 
Uq and U to T G [0, T], tt G D, a G A. And we will consider the restriction of Q that acts on 
functions that are defined on T G [0, T], 7? G D, a G A. Thanks to Lemma 1 of [2T] (also see 
[31]) it is enough to show that Uq < kQO for some k(T) > (We showed that Uq is continuous in 



Proposition 3.1 and that Uq is bounded in [0, T] in Remark 3.1 in order to apply this lemma). For 
any stopping time r <T 



U {T,w,a) =E*' a 



(3.17) 



e- pt c(P t )dt+ £ e-^K((Y k -P^ 



+ I e- pt c(P t )dt+ Yl 



e -p^K((Y k - P ak - 



k:cr k e(T,T] 



Next, we will provide upper bounds for the terms in the second line of (|3.17|). First, note that 
(3.18) [ T e- pt c(P t )dt<C p (r)^ 



c(P) 



-f>r 



P 



if p > 



Second, 
(3.19) 



£ e-^K((Y k -P ( 

fc:<r fc e(T,T] 



(T - r)c(P) if p = 0. 



<e"^(it!)£l W6(T)T]} . 

k 



The expected value of sum on the right-hand-side of (3.19) is bounded above by a constant, namely 



(3.20) 



E 1 



W6(r,T]} 



E*' a [JV(T) -iV(r)] < AT. 
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Using the estimates developed in (3.18 )-( 3.20 ) back in (3.17), we obtain that 
(3.21) 

U (T,7r,a) < E*' a 



<{ 'c p [0) + K(R)Xr wU ^ 



e-» t c(P t )dt+ e-^K((Y k -P ak _) + )+e~" T (C p (r) + K(R)XT) 

f T e -* c(P t ) dt+ J2 e~ pak K((Y k - P CTfc _) + ) + e-^C 

^° k:a k <r 

Minimizing the right-hand-side over all admissible stopping times r we obtain that Uo(T,jf,a) 
C p (0) +K(R)XT v \ , nf f\- P t c{p)dt+ J- e -^K((Y k -P^) + )+e-^C 



C p (0) + K{R)\T 



C 



< I — — ^ — vi 60< p 1 v 1 1 (?o, 



C p + K(i?)A r 



C 



which establishes the desired result. Moreover since T is arbitrary we see that U is indeed the 
unique fixed point of Q among all the functions defined on T G K + , 7? G D, a 6 A 
Step 2. We will show that C/ is continuous. Since {V n ) n£ fq converges to U uniformly on T G [0, T], 
7? 6 D, a G ^4 for any T < oo the proof will follows once we can show that every element in 



the sequence (V n ) ne ^ is continuous. But this result follows from Lemma 3.3 and the continuity of 
U Q . □ 

Using the continuity of the value function, one can prove that the strategy given in the next 
proposition is optimal. The proof is analogous to the proof of Proposition 4.1 of [5]. 

Proposition 3.3. Let us iteratively define = (£o,tq;£i,ti, . . .) via £o = a, To — and 



(3.22) 



r k+1 = inf [s G [r k , T] : U(T - s, U(s),P s ) = MU(T - s, U(s), P s )} ; 



— d MU (T — r k+1 , n(r fc+ i), P Tk+1 



k = 0,1,... 



with the convention that inf = T + e, e > 0, and r k +i = 0. Then £* is an optimal strategy for 



(2.5) 



Proposition 3.1 implies that to implement an optimal policy the manager should continuously 



compare the intervention value MXJ versus the value function U > MXJ . As long as, U > MU, 
it is optimal to do nothing; as soon as U = MU, new inventory in the amount dj^u should be 
ordered. The overall structure thus corresponds to a time- and belief-dependent (s, S) strategy 
which matches the intuition of real-life inventory managers. 
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Remark 3.2. As a result of the dynamic programming principle, proved in Theorem |3.1 , the value 



function U is also expected to be the unique weak solution of a coupled system of QVIs (quasi- 
variational inequalities) 

(3.23) 

d 

-— U(T, vr, a) + AU(T, 7f, a) - pU(T, vf, a) + c(tt, a) < 0, 
ol 

U(T,jr,a) > MU(T,jr,a), 

d \ 
- — U(T, 7T, a) + AU(T, vf, a) - pU(T, vf, a) + c(vr, a) J (U((T, vf, a)) - MU(T, vr, a)) = 0. 

Here A is the infinitesimal generator (first order integro-differential operator) of the piece-wise 
deterministic Markov process (II, P), whose paths are given by Proposition 



2.1 



To determine U 



one could attempt to numerically solve the above multi-dimensional QVI. However, this is a non- 
trivial task. We will see that the value function can be characterized in a way that naturally leads 



to a numerical implementation in Section 3.3 Also, having a weak solution is not good enough 



for existence of optimal control, whereas in Theorem 3.1 we directly established the regularity 



properties of U which lead to a characterization of an optimal control. 

3.3. Computation of the Value Function. The characterization of the value function U as 
a fixed point of the operator Q is not very amenable for actually computing U. Indeed, solving 
the resulting coupled optimal stopping problems is generally a major challenge. Recall that U is 



also needed to obtain an optimal policy of Proposition |3.3| which is the main item of interest for 
a practitioner. 

To address these issues, in the next subsection we develop another dynamic programming equa- 



tion that is more suitable for numerical implementation. Namely, Proposition 3^ provides a 
representation for U that involves only the operator L which consists of a deterministic optimiza- 
tion over time. This operator can then be easily approximated on a computer using a time- and 
belief-space discretization. We have implemented such an algorithm and in Section [4] then use this 
representation to give two numerical illustrations. 

We will show that the value function U satisfies a second dynamic programming principle, 
namely U is the fixed point of the first jump operator L, whose action on test function V is given 
by 



(3.24) L(V)(T,7r,a) = inf E* 

te[o,T] 



t.A(T 1 

e~ ps c(p(s,a)) ds 



o 



+ l {t<aiy e- pt MV(T - t, ii t , P t ) + e-^l {t y ai} V(T - <j x , U ai ,P ai ) 

This representation will be used in our numerical computations in Section |4} Observe that the 
operator L is monotone. Using the characterization of the stopping times of piecewise deterministic 
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Markov processes (Theorem T.33 [13] , an d Theorem A2.3 [IT]), which state that for any r G S(T), 
t A 0i = £ A G\ for some constant t, we can write 



(3.25) 



L(V)(T,7r,a) = inf E* ,a 

r£5(T) 



tActi 



e- ps c(p(s, a)) ds + l {T<ai} e- pT MV(T - r, fl r , P T ) 



+ e-^ 1 l {T > CTl} F(T-a 1 ,n (Tl ,P CT 



The following proposition gives the characterization of U that we will use in Section |4| The 
proof of this result is carried out along the same lines as the proof of Proposition 3.4 of [5J. The 



main ingredient is Theorem |3.1[ We will skip the proof of this result and leave it to the reader as 
an exercise. 

Proposition 3.4. U is the unique fixed point of L. Moreover, the following sequence which is 
constructed by iterating L, 



(3.26) 

satisfies W n \ U (uniformly). 



W Q 4 Uq, W n+1 ^ LW n , neN, 



Remark 3.3. Using Fubini's theorem, (A. 7) and (2.13) we can write L as 



(3.27) LV(T,n,a)= inf j ( J^m^t, 7?) ) • e~ pt MV (T - t, x (t, Tt),p{t, a)) 

+ J e~ pu rrii(u, 7f) ■ (c(p(u, a)) + Xj ■ SjVjT — u, x(u, n),p(u, a)))du^, 



in which Si is given by (3.7). Observe that given future values of V(s, •,•), s < T, finding 
L(V)(T,7r,a) involves just a deterministic optimization over t's. 

In our numerical computations below we discretize the interval [0,T] and find the determin- 



istic supremum over t's in (3.27). We also discretize the domain D using a rectangular multi- 



dimensional grid and use linear interpolation to evaluate the jump operator Si of (3.7). Because 



the algorithm proceeds forward in time with t = 0, At, ... ,T, for a given time-step t = mAt, 



the right-hand-side in (3.27) is known and we obtain U(mAt,fr,a) directly. The sequential ap- 



proximation in (3.4) is on the other hand useful for numerically implementing infinite horizon 
problems. 

4. Numerical Illustrations 

We now present two numerical examples that highlight the structure and various features of 
our model. These examples were obtained by implementing the algorithm described in the last 



paragraph of Section 3.3 
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4.1. Basic Illustration. Our first example is based on the computational analysis in |30j. The 
model in that paper is stated in discrete-time; here we present a continuous-time analogue. As- 
sume that the world can be in three possible states, M t EE— {High, Medium, Low}. The 
corresponding demand distributions are truncated negative binomial with maximum size R = 18 
and 

vt = NegBin(100, 0.99), u 2 = NegBin(900, 0.99), u 3 = NegBin(1600, 0.99). 

This means that the expected demand sizes/standard deviations are (1,1), (9, 3) and (16,4) re- 
spectively. 

The generator of M is taken to be 

/-0.8 0.4 0.4 \ 
Q= 0.4 -0.8 0.4 , 
v 0.4 0.4 -0.8 y 

so that M moves symmetrically and chaotically between its three states. The horizon is T = 5 
with no discounting. Finally, the costs are 

c(a) = a, K(a) = 2a, h = 0, ( = 0, 

so that there are zero procurement/ordering costs and linear storage/stockout costs. With zero 
ordering costs, the controller must consider the trade-off between understocking and overstocking. 
Since excess inventory cannot be disposed (and there are no final salvage costs), overstocking leads 
to higher future storage costs; these are increasing in the horizon as the demand may be low and 
the stock will be carried forward for a long period of time. On the other hand, understocking 
is penalized by the stock-out penalty K. The probability of the stock-out is highly sensitive to 
the demand distribution, so that the cost of understocking is intricately tied to current belief IT. 
Summarizing, as the horizon increases, the optimal level of stock decreases, as the relative cost of 
overstocking grows. Thus, as the horizon approaches the controller stocks up (since that is free 
to do) in order to minimize possible stock-outs. Overall, we obtain a time- and belief-dependent 
basestock policy as in [30J. 

Figure [2] illustrates these phenomena as we vary the relative stockout costs K(a), and the 
remaining horizon. We show four panels where horizontally the horizon changes and vertically the 
stock-out penalty K changes. We observe that K has a dramatic effect on optimal inventory level 
(note that in this example ordering costs are zero, so the optimal policy is only driven by K and 
c). Also note that the region where optimal policy is d*(T, tt) = 1 is disjoint in the top two panels. 

Figure [3] again follows [3U] and shows the effect of different time dynamics of core process M. In 
the first case, we assume that demands are expected to increase over time, so that the transition 
of M follows the phases 1 — > 2 — > 3. In that case, it is possible that inventory will be increased 
even without any new events (i.e. Tfe 7^ o~e). This happens because passage of time implies that 
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Figure 2. Optimal inventory levels for different time horizons for Section <L1 We 
plot the regions of constancy for d*(T, tt) = argmax a U(T, tt, a), which is the optimal 
inventory level to maintain given beliefs tt (since ordering costs are zero), tt G D = 
{-Ki + 7r 2 + 7T3 = 1}. Top panels: K(a) = 2a, bottom panels K(a) = 4a; left panels: 
T = 1, right panels: T = 5. 



the conditional probability n (3) = F(M t = 3|Ff ) increases, and to counteract the corresponding 
increase in probability of a stock-out, new inventory might be ordered. In the second case, we 
assume that demand will be decreasing over time. In that case, the controller will order less 
compared to base case, since chances of overstocking will be increased. 



4.2. Example with Censoring. In our second example we consider a model that treats censored 
observations. We assume that excess demand above available stock is unobserved, and that a 
corresponding opportunity cost is incurred in case of stock-out. 
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Figure 3. Optimal inventory levels for different core process dynamics in [30J 



(TS02) example of Section 4.1 The blue surfaces show U(T,jf,3) over the tri- 
angle 7? G D = {tti + 7T 2 + 7r 3 = 1}; underneath we show the optimal inventory levels 
d*(T, 7f), see Figure 2| Left panel (case US in TS02): Q = ^ o 2 -02 oi j , right panel 

(case DS in TS02): Q = (o°2 -oa oY (Note that in TS02 time is discrete. To be 
able to make a comparison we choose our generators to make the average holding 
time in each state equal to those of TS02.) 

For parameters we choose 

\l) V 0.1 0.3 0.6 

so that demands are of size at most R = 3. Note that in regime 2, demands are less frequent but 
of larger size; also to distinguish between regimes it is crucial to observe the full demand size. 
The horizon is T = 3 and costs are selected as 

c(a) = 2a, K(a) = 3.2a, h = 1.25, £ = 1, 

with P = R = 3. Again, we consider zero salvage value. These parameters have been specially 
chosen to emphasize the effect of censoring. 

We find that the effect of censoring on the value function U is on the order of 3-4% in this exam- 
ple, see Table [2] below. However, this obscures the fact that the optimal policies are dramatically 
different in the two cases. Figure [4] compares the two optimal policies given that current inventory 
is empty. In general, as might be expected, censored observations cause the manager to carry 
extra inventory in order to obtain maximum information. However, counter-intuitively, there are 
also values of t and 7f where censoring can lead to lower inventory (compared to no-censoring). We 
have observed situations where censoring increases inventory costs by up to 15%, which highlights 
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Figure 4. Optimal order levels as a function of time and beliefs given current 
empty inventory, P(0) = 0. We plot d_Mu(T, n, 0) as a function of time to maturity 
T on the x-axis and initial beliefs IIi (0) = P(Mq = 1) on the y-axis. 



the need to properly model that feature (the particular example was included to showcase other 
features we observe below). 

4.3. Optimal Strategy Implementation. In Figure [5J we present a sample path of the (n, P)- 
process which shows the implementation of the optimal policy as defined in Figure |4| We consider 
the above setting with censored observations, T = 3 initial IIi(O) = 0.6 (since n 2 (t) = 1 — Hi(t) 
in this one-dimensional example, we focus on just the first component Ili(t) = P(M t = 1)) and 
initial zero inventory, P(0) = 0. In this example, it is optimal for the manager to place new orders 
only when inventory is completely exhausted. Thus, d,Mu{T,7r,a) is non-trivial only for a = 0; 
otherwise we have d^j/(T, 7f, a) = a and the manager should just wait. 

Since d MU (3, 11(0), 0) = 3, it is optimal for the manager to immediately put an order for three 
units, as indicated by an arrow on the y-axis in Figure |5j Then the manager waits for demand 
orders, in the meantime paying storage costs on the three units on inventory. At time a\, the first 
demand order (in this case of size two arrives). This results in the update of the beliefs according 



to Proposition (2.1) as 

Ai • i/i(2) • ni(tri-) 2 • 0.4 • ni(0- r 



E- = i ^ • Vi{2) ■ IliOn-) 2 • 0.4 • IFK-) + 1 • 0.3 • (1 - Ufa-)) ■ 



This demand is fully observed and filled; since d,Mu(3 — <^i,H ai , 1) = 1, no new orders are placed 
at that time. Then at time 02 we assume that a censored demand (i.e. a demand of size more 
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Figure 5. Implementation of optimal strategy on a sample path. We plot the 
beliefs Ili(t) = P(Mt = 1) as a function of calendar time t, over the different 
panels corresponding to P(t). Incoming demand orders and placed supply orders are 
indicated with circles and diamonds respectively. Here T = 3 and 11(0) = [0.6,0.4]. 
The arrival times are o\ = 1.7, 02 = 1.83, 03 = 1.87 and the corresponding marks 



are Z x = (2, 0),Z 2 = (1, A), Z 3 = (1, 0), see Section 2.4 



than 1 arrives). This time the update in the beliefs is 

n , s A 1 -K(2) + z/ 1 (3)]-n 1 (a 2 -) 2-0.5-n 1 (a 2 -) 

Eli \ • h(2) + ^(3)] • n^-) 2 • 0.5 • n^-) + 1 • 0.9 • (1 - n^-)) ' 

We see that the censored observation gives very little new information to the manager and ^(02) 
is close to n 1 (cr 2 — ). The inventory is now instantaneously brought down to zero, as the one 
remaining unit is shipped out (and the rest is assigned an expected lost opportunity cost). Because 
d>Mu{T — 02, II(o"2), 0) = 1, the manager immediately orders one new unit of inventory, n = 2 . 
Thus, overall we end up with P(<72) = 1. At time 03, a single unit demand (uncensored) is 
observed; this strongly suggests that M az = 1 (due to short time between orders and a small order 
amount), and n 1 (0 3 ) is indeed large. Given that little time remains till maturity, it is now optimal 
to place no more new orders, cImu(T — 03, 11(03), 0) = 0. However, as time elapses, P(M t = 2) 
grows and the manager begins to worry about incurring excessive stock-out costs if a large order 
arrives. Accordingly, at time r 2 ~ 2.19 (and without new incoming orders) the manager places an 
order for a new unit of inventory as (n, P) again enters the region where cLmu — 1 (see lowermost 
panel of Figure [5J. As it turns out in this sample, no new orders are in fact forthcoming until T 
and the manager will lose the latter inventory as there are no salvage opportunities. 
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Model 


Value Function U (T, n , 0) 


Optimal Policy cImu{T, 7T, 0) 


Base case w/out censoring 


25.21 


2 


Base case w/censoring 


25.97 


3 


Reduced stockout costs K(a) = 2a 


16.37 





Zero storage costs c = 


16.71 


3 


No fixed ordering costs 


22.92 


1 


Terminal salvage value of 50% 


24.75 


2 


Can buy/sell at cost, £ = 1 


24.30 


2 



Table 2. Results for the comparative statics in Section |4.4 
11(0) = 7f = (0.5, 0.5), so that P(M = 1) = P(M = 2) = \. 



Here T = 3 and 



4.4. Effect of Other Parameters. We now compare the effects of other model parameters and 
ingredients on the value function and optimal strategy. To give a concise summary of our findings, 
Table [2] shows the initial value function and initial optimal policy for one fixed choice of beliefs 
11(0) and time- horizon T. 

In particular, we compare the effect of censoring, as well as changes in various costs on the 
value function and a representative optimal policy choice. We also study the effect of salvage 
opportunities and possibility of selling inventory. Salvaging at x% means that one adds the initial 
condition [7(0,7?, a) = —x ■ h ■ a to (2.5), so that at the terminal date the manager can recover 
some of the unit costs associated with unsold inventory. Selling inventory means that at any point 
in time the manager can sell back unneeded inventory, so that £ {— P(rk), . . . ,P — P(rk)} and 
the minimization in (3.13) is over all b £ [0, P]. 

As expected, storage costs increase average costs and cause the manager to carry less inventory; 
conversely stock-out costs cause the manager to carry more inventory (and also increase the value 
function). Fixed order costs are also crucial and increase supply order sizes, as each supply order 
is very costly to place. We find that in this example, the possibility of salvaging inventory and 
opportunity to sell inventory have little impact on the value function, which appears to be primarily 
driven by potential stock-out penalties. 



5. Conclusion 

In this paper we have presented a probabilistic model that can address the setting of partially 
observed non-stationary demand in inventory management. Applying the DP intuition, we have 
derived and gave a direct proof the dynamic programming equation for the value function and the 
ensuing characterization of an optimal policy. We have also derived an alternative characterization 
that can be easily implemented on a computer. This gives a method to compute the full value 
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function/optimal policy to any degree of accuracy. As such, our model contrasts with other 
approaches that only present heuristic policy choices. 

Our model can also incorporate demand censoring. Our numerical investigations suggest that 
censored demands may have a significant influence on the optimal value to extract and even a 
more dramatic impact on the manager's optimal policies. This highlights the need to properly 
model demand censoring in applications. It would be of interest to further study this aspect of 
the model and to compare it with real-life experiences of inventory managers. 

Appendix A. Proof of Proposition IA.2I 
A.l. A Preliminary Result. 
Lemma A.l. For i G E, let us define 

(A.l) Lf a >t(t,r:(t k ,z k ),k<r)±E*' a >t 



l {Mt =i}-e- m -Y[e(t k} z k ) 



k=l 



where 
(A.2) 



*{t, z)=Yl l iMt=i) x i ■ 9j{% p t-)- 



Moreover let L n,a '^(t,r : (t k ,z k ),k < r) — J2jeE L] ,a (t, r : (t k ,z k ),k < r). Then we have 



. . . . Lf' a ' c (t, N t : (a k , Z k ),k< N t 
(A.3) Ui(t) - 1 v ' v ' ; ' ~ 



(t,r : (t k ,z k ),k < r) 
L**t(t,r : (t k ,z k ),k<r) 



r — Nt ; {tk—GkA—Z^kKr 



L*(t,N t :(a k ,Z k ),k<N t ) 
P -a.s., for all t > 0, and for i G E. 
Proof. Let H be a set of the form 

E = {N tl =r 1 ,...,N tk =r k ;(Z 1 ,...,Z rh )eB] 

where = t < t\ < ■ ■ ■ < t k — t with < r x < . . . < r k for k G N, and B is a Borel set in B(W k ). 
Since tj and r^-'s are arbitrary, to prove (A.3) by the Monotone Class Theorem it is then sufficient 
to establish 



E ff,a,€ |- lg . P ^{ Mt = i\J^}] = 



h 



Lf a ^(t,N t :(a k ,Z k ),k<N; 



" L^(t,N t :(a k ,Z k ),k<N t ) 
Conditioning on the path of M, the left-hand side (LHS) above equals 



LHS = ¥?> a 't 



l {Mt =i}^ {N h = n ,..., N tk = r k ; (Z u . . . , Z r J € B 



M s , s <t 



{M t =i\ 



BxT(t u ...,t k ) 



w \ a\ G dsx, . . . , a r . G ds rk ; Z x = z\, . . . , Z n 



M s , s <t 
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where 



T(ti, . . . , t k ) = {s u . . . , s rk e IR+ : Si < . . . < s rk < t and s rj < tj < s r . +1 for j = 1, . . . k) 



Then, using (2.10) and Fubini's theorem we obtain 



LHS = E 7 ^ 



l{Mt=i} / 

BxT(ti,...,t fc ) 



Bxr(t u ...,t k ) 



■Ik 



l-k 



l{M t =i>e m Yl l {Ms e =j}h9j(^ Psi) 

rk 

Li' a '*(t,r k : (sj,Zj),j < r k )Y[ds e 

i 

L*> a 't(t,r k : {sj,Zj),3 < r k ) \\ds f 



Bxr(t u ...,tk) e=1 

Li' a *(t,r k : (s j} Zj),j <r k ) - 



BxT( tl ,...,t k ) L^(t,r k : (a.,-, Y}), j < r k ) 



=1 



• E w 



^ l{A/ t =i}e \\ 1 {M. t =j}^j9j{zi\ P H ) 

JEE £=1 j£E 



n dst 



By another application of Fubini's theorem, we obtain LHS 



kM*- 



-i} 



i£E 



Li' a, H^ r k ■ (sj,Zj),j < r k ) 



rk 



Tk 



BxT( tl) ...A) L*> a >Z(t,r k : (a,. :})../ < r fe ) , | ^ 



■ e /(<) n J2 i {Ms e =i}^9j(ze; p. t ) ■ n 



E ff ' a ' € 



ME 



HN tl =ri,...,N tk = rM ,...,Z rk )zB} ■ fratfaty . faZ&j < N t ) 



M s ; s <t 



HN tl =r 1 ,...,N tk =rk-,(Z 1 ,...,Z rk )eB} ' ^ ^ . (^^j < ^ 



* L-^Nt-.^Z^jKN,) 



□ 
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A. 2. Proof of Proposition 2.1, Let [•] denote the expectation operator E^' '^- |M = j], 
and let t r < t < t + u, then 

(A.4) Lf°"\t + u,r: (t k , z k ),k<r) = J2 *i ' E?* 



l { M t+u=4} -e-^-n^,4) 



fc=i 



E 



k=l 



j&E 



e- 1 ® (l[e(t k ,z k ) ]E a / 



\k=l 



L{M t+ „=i} • 



M S ,S <t 



-(/(*+«)-/(*)) 



where the third equality followed from the properties of the conditional expectation and the Markov 
property of M. The last expression in (A.4) can be written as 

= E^ E ? 



Z6£ 



fc=i 



l£E 



Then the explicit form of II in (A. 3) implies that for a m < t < t + u < <J m +ii we have 



Ui(t + u) 



Zi £E Lf(t, r : (a k , z k ),k<r)- Ef [l {Mu=t} ■ e^} 
EjeEEieELfi^r : {a k ,z k ),k < r) -Ef [l {Mu=j} ■ er'M] 



(A.5) 



E^E^n,(t) -E^ [l {Mu=j} • e-W] E^E* [l{M tt =i} • e"'(«)] 
E flt [l {Mi=i} ■ e- 7 (")] _ F*{<ti >u,M u = i} 



EHt f e -^(«) 



7f=n t 



On the other hand, the expression in (A.l) implies 
(A.6) 



Lf Af (<r r+1 , r+1 : Z fc ), A; < r+1) = E^ 



r+1 



Vt=Q • e" 7(i) ■ JJ £(t fc , lib) 



fc=l 



t— ""r + 1 ; {tk— a k:yk—Zk)k<r+l 



\ igi {Z r+1 ;P ar+1 _)-E*> a >S 



k=l 



t—a r+ i ; (t k —CTk,Zk—Z k )k<r 
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Note that for fixed time t, we have M t = M t -, P^'^-a.s. and Lj ,a ' xl (t,r : (t k ,z k ),k < r) = 
L^ ,a ^(t—, r : (t k , z k ),k < r) when t r < t. Then we obtain 

(A.7) 

Lj' a >Z(cr r+1 , r + 1 : (a k , Z k ), k < r + 1) = Ajft(2" r+1 ; P CTr+1 -) • L^ a ' $ ((7 r+1 -, r : (a k , Z k ),k<r), 



due to (A. 6). Hence we conclude that at arrival times o~\,o~2, ... of Z, the process II exhibits a 



jump behavior and satisfies the recursive relation 

X i g i (Z r+1 ;P ar+1 .)Lf(a r+1 -,r : [a k ,Z k ),% < r) 



IIi(oY + i) 



\gi\Z r +\] -Pcr r+ i-)ni((T r+ i — ) 



for rGN. 



Appendix B. Analysis Leading to the Proof of Proposition 13.21 



B.l. Preliminaries. Let us consider the following restricted version of (2.5) 

(B.l) U n (T,if,a) = inf J^(T,if,a), n > 1, 

£ew„(T) 

in which U n (T) is a subset of U{T) which contains strategies with at most n > 1 supply orders up 
to time T. 



The following proposition shows that the value functions (U n ) n& ^ of (B.l) which correspond 
to the restricted control problems over U n {T) can be alternatively obtained via the sequence of 
iterated optimal stopping problems in (3.16). 

Proposition B.l. U n = V n for iigN. 

Proof. By definition we have that Uq = Vq. Let us assume that U n — V n and show that U n+ \ = 
V n+ i. We will carry out the proof in two steps. 

Step 1. First we will show that U n+ \ > V n+ \. Let £ G U n+ i(T), 

n+l 



k=0 



with r = and r ra+1 = T, be e-optimal for the problem in ( |B.1[ ), i.e., 
(B.2) 



U n+1 (T, if, a) + e> J ? (T, if, 



a . 
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Let £ G U n {T) be defined as = r^+i , = for k G N+. Using the strong Markov property 
of (IT, P), we can write as 



(B-3) 



jt(T, 7T, a) = [ / e- ps c(P s ) rfs + ^ e-^i w < Tl} if((y, - P fft . 



> E 71 



o 

+ (j*{t - n, n n , P T1 _ + CO + h ■ a + C 

n e-" s c(P s ) cfe + ^e- CT n w < Tl} K((F, - P c 



[V n (T - n , n T1 , P T1 _ + 6) + h ■ 6 + C 



> E 71 



1 e-" s c(P s ) ds + ^e- <T a w < ri} K((F, - P CTr 



e^A^T-r^n^P^ 
> GV n (T, if, a) = V n+1 (T, 7f, a). 



Here, the first inequality follows from induction hypothesis, the second inequality follows from the 



definition of Ai, and the last inequality from the definition of Q. As a result of (B.2) and (B.3) 
we have that U n+ i > V n+ i since e > is arbitrary 

Step 2. To show the opposite inequality U n+ i < V n+ i, we will construct a special £ G U n +i(T). 
To this end let us introduce 



(B.4) 



t x = mf{t > : MV n (T - t, n t , P t ) < V n+l {T - t, U t , P t ) + e}, 
£i = g?mv„ (T-ti, Ilr! , Pf!- ) • 



Let £ t = X]fc=o^ fc ' %fc»T)t+i)(*)) £ e Mn(T) be e-optimal for the problem in which n interventions 
are allowed, i.e. (B.l). Using £ we now complete the description of the control £ G U n+ i(T) by 



assigning, 



(B.5) 



in which 6* is the classical s/iz/t operator used in the theory of Markov processes. 

Note that t\ is an e-optimal stopping time for the stopping problem in the definition of QV n . 
This follows from the classical optimal stopping theory since the process (IT, P) has the strong 
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V n+1 (T, 7r,a)+e> W A \ / e-" s c{P s ) ds + Y^ ^l Wt < Tl] K{{Y, - P^ 



(B.6) 



+ e- p ^MV n (T-r 1 ,U Tl ,P 7l . 



> E 71 



e-» s c(P s ) ds + J2 e- CT n w < Tl} K((F, - P ut J) + ) 



+ e-^ 1 U n (T - ri, U Tl , iV + f i) + h ■ ^ + ( 



in which the second inequality follows from the definition of £ x and the induction hypothesis. It 



follows from (B.6) and the strong Markov property of (n, £) that 

e-e s c(P s ) ds + ^e- CT n w < Tl} K((F, - P c 



V n+1 (T,7r,a) + 2e>E*> a 



+ U n {T - r 1; n^, P Tl _ +Q+e + h-Z 1 + ( 



(B.7) 



> E 7 ^ 



e-<> s c(P s ) ds + J2e" T ' 1 Wt<n}K((Ye ~ *V 



J0 

+ (j*{T - T\ , n^Pr,- +£i) + h-£ 1 + ( 
= jt(T,7r,a) < U n+1 (T,7r,a). 
This completes the proof of the second step since e > is arbitrary. 



□ 



B.2. Proof of 3.4, Let us denote V(T,7r,a) = lim^oo V n (T, jf, a), which is well-defined thanks 
to the monotonicity of (V^) n eN- Since U n {T) C U(T), it follows that V n (T,jf,a) = U n (T,ir,a) > 
U(T,7T,a). Therefore V(T, n, a) > U(T,7r,a). In the remainder of the proof we will show that 
V(T,Tf,a) < U(T,T?,a). 

Let £ E U(T) and £t = 6at„- Observe that £ G U n (T). Then 



\J*(T. 
(B.8) 



,vr,o) 



jHT,7r,a)\ 



k>n+l 



r n <ae<T 



< 2c(P) 



e~ ps ds 



+ 2K(R)W> a £ 



. t k>n+l 
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Now, the right-hand-side of (B.8) converges to as n — > oo. Since there are only finitely many 



switches almost surely for any given path, 



lim [ T l {s>Tn}e -t> s ds+ J2 e->"*(/i.& + C) = 

J[) k>n+l 



The admissibility condition (|2.6|) along with the dominated convergence theorem implies that 



lim E 7f ' a ' 5 



/ e"^ + ^ e-^(/i-a + C) 

^ Tn fc>n+l 



0. 



On the other hand, 



E^EWl = E*-<[iV(T)] < oo, 



(see e.g. the estimate in (3.20)). Therefore by the monotone convergence theorem 

lim ^£^1 = 0. 

As a result, for any e > and n large enough, we find 

|J 5 (T,v?,a) - J*(T,TT,a)\ < e. 

Now, since £ G U n {T) we have 14 (T, ff, a) = U n (T,TT,a) < J^(T,n,a) < J^(T,7i,a) +e for 
sufficiently large n, and it follows that 

(B.9) V(T,7r,a) = lim V n (T,jr,a) < J 5 (T, vf, a) + e. 

n— »oo 

Since £ and £ are arbitrary, we have the desired result. 



B.3. Proof of Proposition 3.2[ Step 1. First we will show that U is a fixed point of Q. Since 
V n > U, monotonicity of Q implies that 



V n+1 (T,7t,a)> inf E^ a / e~ ps c(P s ) ds + V e' pCT n Wk < r} K((Y k - P CTfc _)+) 

reS(T) lj 

+ e~ pT MU(T -r, n r ,P T _) 

Taking the limit of the left-hand-side with respect to n we obtain 

U(T,i?,a)> inf [ / e- ps c(P 8 ) ds + Y] e -^l Wk < T} K((Y k - P fffe _) + ) 
reS(T) lj 

+ e~ pT MU(T -t, n T ,P T _) 
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Next, we will obtain the reverse inequality. Let r £ S(T) be an e-optimal stopping time for the 
optimal stopping problem in the definition of QU, i.e., 



(B.IO) 



E n 



e~ ps c(P s ) ds + J2 e~ pak M« k <f}K((Y k - P CTfc _)+) + e - pf MU(T - f, II,, P f . 



< gU(T,n,a)+e. 



On the other hand, as a result of Proposition 3.4 and the monotone convergence theorem 
(B.ll) 

U (T, 7?, a) = lim V n (T, ff, a) 



< lim E^- a 

n— >oo 



E 



7r,a 



e-" s c(P s ) + e- pa n Wk ^ } K((Y k - P CTfe _) + ) + e^MV^T - f, ftp, P f 

k 

e- ps c(P s ) ds + J2 ^ pUk ^W k <f}K((Y k - P CTfe _) + ) + e - pf MU(T - f, ff f , P f _ 



Now, (B.IO) and (B.ll) together yield the desired result since e is arbitrary. 

Step 2. Let U be another fixed point of Q satisfying U < Uq = Vq. Then an induction argument 
shows that U < U: Assume that U < V n . Then QU < QV n = V n+ i, by the monotonicity of Q. 
Therefore for all n, U < V n , which implies that U < sup n V n = U . □ 
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